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1. Introduction 


In many fields of application it’s common to be interested in spatially-related phenomena 
and in particular to deal with attributes which are defined on continuous spatial domains. In 
this framework, if the design-based approach is assumed, the attribute is usually expressed as 
a function y(s) taking values on a suitable subset s of the plane. In the simplest case y(s) 
represents the value of an attribute at the location s. As an example, in forestal surveys y(s) 
could be the amount of biomass measured in sampled sites over a forestal area; in environmental 
studies y(s) could be the quantity of plastic materials collected by net tows in sampled areas 
over seas; etc... . 

Technology development has led to a growing availability of low-cost spatial data ready- 
to-use, frequently derived from large scale observations (i.e. data from pervasive systems like 
GPS sensors, or remote sensing data from earth observation technologies). Oftentimes, these 
data can’t directly answer specific questions posed by researchers and data users, or even if 
they can they are subject to measurement errors or self-selection bias. In both cases it is still 
necessary to rely, at least partially, on ad-hoc probabilistic surveys. On the other hand, the 
precision and quality of surveys estimates can be improved by using the data derived from these 
new sources as auxiliary information in the design phase and/or in the estimation phase. 

Geographical data generally show a spatial pattern and an uneven spatial distribution over 
the population. In fact, usually spatial observations are not mutually independent and tend to be 
more similar to their neighbours. As stated by Tobler’s first law of geography (Tobler, 1970): 
“everything is related to everything else, but near things are more related that distant things”. 
This arises because nearby units interact with one another and tend to be influenced by the same 
set of natural and anthropogenic factors. 

In such situations, it is well known that to estimate a mean or a total of a target variable 
selecting the units spatially best spread allows to collect more information and consequently 
provides better estimation. An important problem of sampling is thus to spread at best the 
sampled units in space. When, in addition to the spatial allocation, the value of one or more 
auxiliary variables is known for all the population units over the spatial domain, exploiting this 
information in the sampling design could further improve survey estimates. 

A well-spread sample is usually said to be spatially balanced. Different types of spatially 
balanced sampling designs have been suggested in literature for sampling spatial population. 
Many, but not all of them, allow the use of auxiliary information, in a more or less simple way, 
during the units’ selection procedure. For example various types of multi-phase systematic 
designs are used in different countries to produce National Forest Inventories for their forest 
monitoring programs. Tillé (2020, Chapter 8), Tillé and Wilhelm (2017), Benedetti et al. (2012) 
and Wang et al. (2012) give a review of the main spatial sampling methods. Since we are 
focusing on data that come from large scale observation (i.e. remote sensing data) to produce 
estimates at large scale, in the following we will focus on balanced sampling designs that can 
be easily implemented for big datasets. 

We consider several sampling strategies, based on the spatially Balanced Sampling through 
Local Pivotal Method (LPM) introduced by Grafström et al. (2012), in order to identify the 
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one which exploits geographical location and other sources of information to produce estimates 
for a spatially-related phenomenon in a more cost-efficient way. A strategy which could be 
globally applied by accounting for different areas characteristics in both the study and auxiliary 
variables, as well as for the differences in their relation. In all but one of the strategies under 
evaluation the sampling scheme consists of a different variation of the LPM, and therefore 
a single-phase non-informative sampling design is implemented. In addition, we propose an 
informative design which is based on a sequential use of the LPM and draws the final sample in 
two (or more) steps: (i) in the first step we collect an initial sample of observations on the target 
variable, which is used to investigate the relation between the auxiliary and study variables; (ii) 
then, this relation is exploited to target and tailor the subsequent sampling step; (iii) additional 
steps can be included by applying the procedure iteratively; (iv) finally, observations on the 
target variable collected in all the steps are used in the estimation process of the population 
mean. 

The performance of the different strategies is investigated through Monte Carlo experiments 
by considering several scenarios, which differ in the distributions of the auxiliary and study 
variables and in their relation. 


2. Sampling methods 


Usually, in a spatial setting, the population units are plots or cells of a grid overlapping an 
area of interest. A value, y;, of a variable of interest is associated with each unit i(i = 1,..., N) 
of the population. Moreover for each unit the spatial location s;, s € R? is known. Here, in 
addition we assume to know the value x; of an auxiliary variable for each unit of the population. 

For drawing a spatial sample from such a population we decided to consider as starting point 
the spatially Balanced Sampling through Local Pivotal Method (LPM) introduced by Grafström 
et al. (2012) since it is a flexible spatially balanced design that can draw equal and unequal prob- 
ability samples in multiple dimensions. Unequal probability sampling can be more efficient than 
equal probability sampling if there is a positive correlation between the inclusion probabilities 
and the response values. Additional dimensions could include any auxiliary information in 
addition to the spatial coordinates. 

The basic idea of LPM is to avoid that units close in distance appear together in the sample. 
First an inclusion probability 0 < m; < 1 is assigned to each unit so that their sum over the pop- 
ulation is equal to the fixed sample size. The sample is then obtained in at most N steps, where 
N is the population size. At each step one unit 7 is selected randomly from the available popula- 
tion and another unit 7 is chosen among the remaining units in the population by minimizing a 
distance function among them. This can be a univariate or a multivariate function that measures 
the distance with respect to one or more auxiliary variables, among which we can include the 
spatial coordinates. When all the variables are continuous the Euclidean distance is commonly 
used. Moreover, when multiple auxiliary variables are used, they are usually standardized or 
scaled in order to balance the contribution of each variable. After the selection of the unit 7 and 
j their inclusion probabilities are updated by using the following rule: 
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As a result, in each step at least one unit is removed from the population frame, either be- 
cause its probability becomes zero, and consequently it is definitely excluded from the sample, 
or because its probability becomes one and therefore is included in the sample. The procedure 
continues, updating at each step the probabilities of inclusion obtained in the previous step, 
until all units in the population are processed. LPM selects the units with the same probability 
7,S initially assigned to them, therefore the population mean can be estimated with the usual 
Horvitz-Thompson estimator. 

The following specific LPM based sampling designs, which differ in how they exploit loca- 
tion and auxiliary information, have been investigated: 


1. SpatLPM: The original formulation of the spatially balanced sampling through LPM 
which produces samples that are well spread in the geographic space and is based on 
equal inclusion probabilities. 

2. AuxLPM: Sampling, with equal inclusion probabilities, balanced through LMP in the 
space spanned by the auxiliary variable. 


3. BivLPM: Sampling, with equal inclusion probabilities, balanced through LMP in the 
space spanned by both the geographical coordinates and the auxiliary variable. 


4. UneqLPM: Spatially balanced sampling through LMP with unequal inclusion probabili- 
ties 7,8 proportional to the auxiliary variable. 


5. StrPropAuxLPM and StrNeyAuxLPM: Stratified sampling with AuxLPM design in 
each stratum. The area of interest is partitioned in sub-areas (strata) and then the AuxLPM 
is applied in each stratum. Two allocation rules are considered: Proportional and Ney- 
man’s with respect to the variance of the auxiliary variable. 


6. StrPropBivLPM and StrNeyBivLPM: Stratified sampling with BivLPM design in each 
stratum. The same stratification designs described in the previous point, but with BivLPM 
applied in each stratum. 

7. SeqUneqLPM: First an initial UneqLPM sample of size no < n (n is the final size of the 
sample) is selected and used to investigate the relation between the auxiliary and study 
variables, specifically to estimate the parameters of a generalized additive model (GAM); 
then the predicted values of Y are used to draw the remaining sample units with spatially 
balanced sampling through LMP with unequal inclusion probabilities 7;s proportional 
to predicted values; finally an Horvitz-Thompson-type estimator is applied to produce a 
mean estimation that exploits the data collected in both steps. 


A possible alternative to the LPM method could be the double balanced sampling of Graf- 
ström and Tillé (2012), however this sampling design is highly computationally demanding 
when applied to big datasets, and was unfeasible in our experiments. Conversely, LPM design 
has been optimized for large datasets using k-d trees (Lisic and Cruze, 2016), allowing to run 
our Monte Carlo experiments in a reasonable amount of time. 


3. Simulation study 


We investigate the performance of the different sampling designs through Monte Carlo ex- 
periments based on several synthetic datasets. In each of them the auxiliary (X) and response 
(Y) variables are drawn from a stationary bivariate spatial process [X (s), Y (s)| with s € [0, 10]? 
(1000 x 1000 grid). Following Diggle and Ribeiro (2007, Chapter 3), each bivariate spatial pro- 
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cess in turn is obtained as: 


X(s) = f(a x Zi(s) +c * Zo(s)) + ky 
Y (s) = g(b* Zı(s) + d * Z3(s)) + k2 


where: 


e Z1(s), Z2(s), Z3(s) ~ are independent univariate stationary Gaussian processes with an 
Exponential variogram with scale 1 and sill o? = C + Co = 50, where C is the partial 
sill and Co is the nugget. For Z2(s) and Z3(s) we assume C = 50 and Cy = 0 in all 
cases, while for Z,(s) their values vary (C = 50,30, 0) to change the proportion of the 
co-variability that has spatial structure; 

° a, b, cand d are constants whose values vary in order to obtain different correlation level 
and structure; 

e f(-) and g(-) are transformation functions, for which we consider two choices: Identity 
or Exponential; 

e kı and ky are adding constants to guarantee X (s) > 0 and Y (s) > 0. 
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Figure 1: Variables X (s) and Y (s) simulated under settings Al, A7 and B7. 


Overall, we present our results for 24 synthetic datasets which differ in the spatial distribu- 
tion of both the study and auxiliary variables, as well as in their relation. The complete list of 
settings used to generate the synthetic datasets is presented in Table 1. To give a better idea of 
the different relations between X, Y and s that can be simulated in our data, Figure 1 shows 
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Table 1: Root mean square error, with 500 replications and sample size = 1000 


Settings A: Zi, Z2, Z3 with C = 50, Co = 0 f(-) and g(-) = Identity 

Id a b c d Emp.corr. SRS  SpatLPM AuxLPM BivLPM UneqLPM SeqUneqLPM 
Al 0.6 0.6 1 1 0.298 0.265 0.093 0.257 0.131 0.130 0.113 
A2 0.385 1 1 0.385 0.350 0.248 0.088 0.221 0.128 0.111 0.102 
A3 1 0.385 0.385 1 0.361 0.242 0.085 0.244 0.115 0.143 0.103 
A4 0.82 0.82 1 1 0.424 0.295 0.103 0.250 0.149 0.138 0.119 
A5 1 1 1 1 0.515 0.324 0.113 0.314 0.151 0.131 0.119 
A6 1 1 0.82 0.82 0.607 0.297 0.104 0.228 0.113 0.110 0.106 
A7 1 1 0.6 0.6 0.738 0.269 0.095 0.178 0.095 0.078 0.082 
Settings B: Z, with C = 30, Co = 20 Z2, Z3 with C = 50, Co = 0 f(-) and g(-) = Exponential 

Id a b ¢ d Emp.corr. SRS  SpatLPM AuxLPM BivLPM  UneqLPM  SeqUneqLPM 
Bl 0.6 0.6 1 1 0.310 0.266 0.129 0.255 0.130 0.151 0.122 
B2 0.385 1 1 0.335 0.340 0.241 0.161 0.217 0.153 0.150 0.139 
B3 1 0.385 0.385 1 0.400 0.244 0.107 0.236 0.117 0.130 0.102 
B4 0.82 0.82 1 1 0.437 0.295 0.156 0.264 0.147 0.140 0.136 
B5 1 1 1 1 0.526 0.322 0.181 0.277 0.149 0.138 0.134 
B6 1 1 0.82 0.82 0.617 0.294 0.173 0.234 0.130 0.114 0.111 
B7 1 1 0.6 0.6 0.744 0.264 0.166 0.189 0.103 0.085 0.083 
Settings C: Zı with C = 0, Co = 50 Z2, Z3 with C = 50, Co = 0 f(-) and g(-) = Identity 

Id a b c d Emp.corr. SRS  SpatLPM AuxLPM BivLPM UneqLPM SeqUneqLPM 
C1 0.6 0.6 1 1 0.298 0.250 0.154 0.236 0.142 0.129 0.135 
C2 0.385 1 1 0.335 0.357 0.228 0.233 0.206 0.189 0.180 0.196 
C2 1 0.385 0.385 1 0.345 0.233 0.114 0.240 0.115 0.159 0.102 
C4 0.82 0.82 1 1 0.429 0.275 0.200 0.303 0.158 0.138 0.142 
c5 1 1 1 1 0.522 0.300 0.239 0.287 0.158 0.133 0.150 
C6 1 1 0.82 0.82 0.616 0.274 0.236 0.226 0.142 0.108 0.127 
C7 1 1 0.6 0.6 0.747 0.247 0.234 0.167 0.105 0.078 0.099 
Settings D: Zı with C = 30, Co = 20 Z2, Z3 with C = 50, Co = 0 f(-) and g(-) = Exponential 

Id a b c d Emp.corr. SRS  SpatLPM AuxLPM BivLPM UneqLPM SeqUneqLPM 
D1 0.12 0.1 0.08 0.09 0.508 0.038 0.022 0.033 0.018 0.015 0.015 
D2 01 0.1 0.05 0.05 0.723 0.030 0.018 0.021 0.013 0.007 0.010 
Settings E: Zı, Z2, Z3 with C = 50, Co = 0 f(-) and g(-) = Exponential 

Id a b c d Emp.corr. SRS  SpatLPM AuxLPM BivLPM UneqLPM SeqUneqLPM 
El 0.1 0.1 0.05 0.05 0.763 0.026 0.013 0.018 0.011 0.007 0.008 


variables X (s) and Y (s) generated under settings Al, A7 and B7: in scenario Al we observe a 
weak correlation between X and Y (equal to 0.298), with both variables strongly related with 
space; in both scenarios A7 and B7 the correlation between X and Y is stronger (more than 
0.7), but they differ with respect to the spatial structure of the data since in scenario B7 part of 
the co-variability (about 40%) is not spatially related. 

We choose to simulate scenarios with the different settings discussed above because when 
the analysis concerns a phenomenon measured at global scale it is common to observe different 
pattern between different areas of the globe and our aim is to find a strategy which could be 
globally applied by accounting for the various areas characteristics. 

Table 1 presents for each dataset the root mean square error (rmse) of the mean estimator for 
the sampling designs described above, in addition to the simple random sampling (SRS) which 
is included as a comparison. The results for the stratified designs are omitted for lack of space 
since they were in line with the other strategies but they were never the best. 
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Results confirm that, as expected, when we analyse spatial-related phenomena spreading 
the sample over the area of interest is always convenient: the SpatLPM strategy is always better 
than both SRS and AuxLPM. Nonetheless, the use of the auxiliary information can improve the 
efficiency of the estimates, in particular if it is used to calculate the inclusion probabilities in 
the unequal designs. 

It is important to note that, in order to evaluate when it is more or less convenient to use 
the auxiliary variable in addition to the geographical location, it is not enough to consider the 
correlation between X and Y: given the same level of correlation, estimates’ efficiency depends 
on the proportion of co-variability that has spatial structure. If the co-variability is all defined 
by a spatial structure (that is, when Z; has C = 50), the SpatLPM design (with equal selection 
probabilities) is enough; on the other hand when part (or all) of the co-variability in not spa- 
tially related (that is, when Z, has C = 30 or C = 0), the additional auxiliary variable improves 
the estimates’ efficiency, especially if used to define the unequal inclusion probabilities (Un- 
eqLPM). Moreover, UneqLPM performs better than SpatLPM even when the relation between 
X and Y is not linear (but still positive). 

Finally, SeqUneqLPM works better than UneqLPM when the performance of the latter is 
worse than that of SpatLPM but it does not always manage to reach or improve the performance 
of the UneqLPM when this is better than the SpatLPM. These last results are very preliminary, 
as the experiments are still ongoing. Investigation is required on the possibility to modify 
the sequential procedure in order to consider more phases in which to update the inclusion 
probability. Moreover, experiments with more additional explanatory variables are in plan. 
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